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Achievement test reliability as a function of ability 
vas determined for multiple sections of a large university French 
class (n=193)« A 5-option multiple«choice examination was 
constructed, least attractive distractors were eliminated based on 
the instructor's judgment, and the resulting three forms of the 
examination Ci«e« 3-, or S^choice question form) were randomly 

assigned to quiz sections with similar mean cumulative grade point 
averages* Students were later grouped into high (3*6-4.0), average 
(3*1-3.5)^ and low (0-3-.0> ability levels based on their final course 
grades in French where B=3.0 and A=ci.O« A Kuder-Elchardson 20 
reliability coefficient was computed for each test form for each 
ability group and adjusted by the Spearman- Brown formula. Differences 
among reliabilities for the three forms were: (1) significant at 
alpha=.05 for the low ability group; (2) not significant for the high 
ability group; and (3) significant at alpha=.10 for the average 
ability group* The ability groups were combined and differences among 
reliabilities for the three forms were significant at alpha=-05* The 
optimal number of alternatives for all ability groups combined was 
four* (Author/BL) 
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ABSTRACT 



Achievement test reliability as a function of ability was dt:terniined for 
multiple sections of a large University of Washington French class. Previous 
empirical and theoretical papers suggested that reliabilities of tests with 
3-option items were as hiqh or hiqher than tests with 2-, 4-, or 5-options. 
Lord (1977), however, has arqued that decreasing the number of options 
resulted in a rrore efficient test for high-ievel examinees but a less efficient 
test for low level examinees. Results of this study did not support this 
arnument in a classroom situation. An explanation for the discrepancy is 
presented. 
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A number of studies have examined the effects on test reliability of the 
number of alternatives presented on multiple-choice items (Ebel, 1969; Grier, 
1975; Lord, 1944, 1977). Several theoretical formulations have suggested that 
for integer values the 3-choice item allows maximum tesc reliability (Tversky, 
1964; Grier, 1975) with 2-choice items next best (Grier, 1975). A model 
assuming knowledge or random guessing was used to algebraically derive the 
reliability of scores on a test composed of n equivalent A-choice ''ems. This 
approach (Lord, 1977) suggested that 3-choice items are optimum in maximizing 
test reliability when difficult level p equals .5 and item intercorrelations r 
equal .2 or .3. Williams and Ebel (1957), Costin (1970, 1972), and Straton and 
Catts (1980) empirically found tests composed of 3-option items to be more 
.reliable than 2-, 4-, and 5-option item tests. Lord (1977), however, using an 
item characteristic curve model with data from the College Board Scholastic 
Aptitude Test , found fewer options per item to be more efficient for high 
ability level examinees but to be less efficient for low ability level examinees 
when the total number of alternatives was held constant. Weber (1978) examined 
the effects of number of choices per item on reliabilities of classroom tests. 
She concluded that more choices per item yield higher test ireliabilities for low 
achievers when time and test length are fixed. She compared only 3- vs. 4- and 
3- vs. 5-choice tests with small sample sizes (N's=13-28) and short tests (19-20 
items) in repeated administration of the tests to the same group. The present study 
compared the reliabilities of 3-, 4-, and 5-choice tests for low, average, and 
hiah ability level examinees. It examined whether the result suggested by Lord 
would be obtained under typical classroom testing conditions with use of a 
quasi -mastery exam rather than simulation of expected scores derived from the SAT 
as did Lord. It also provided an extension and improvement of Weber's design. 
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Weber used d repeated measures design, administering two versions of a test to the 
same group with a time lag between administrations. This design allows for 
confoundinq of results that are due to memory of item responses or to learning 
between vest administrations. The present study employed independent groups* 

Consistent with Lord (1977) and Weber (1978), the hypotheses ^or 
this study were: 

1. Internal consistency reliability coefficients decrease significantly 
as number of options decreases for low-ability level examinees. 

2. Reliabilities increase significantly as number of options decreases 
for high-ability level examinees. 

3- No significant differences exist between reliabilities as number of 
options decreases for average-ability examinees. 

METHOD 

Participants in this study were 193 students in nine quiz sections of a 
beginning French class at the University of Washington. A 5-option multiple-choice 
examination was constructed. Distracters were then systematically eliminated 
from each question to form the 4- and 3-choice questions. Elimination of 
distracters was based on the instructor's judgment about the least attractive 
alternatives. The three forms of the examination were then randomly assigned to 
quiz sections (but not to individual students). Differences in mean cumulative 
grade point average (obtained frorn official academic records) among sections were 
assessed with a one-way analysis of variance. No significant main effect, however, 
was found for quiz section. All students within a given quiz section received the 
same test during the eiohth week of instruction. Students were given 40 minutes 
to complete the 40-item tests. Since all students finished within this time. 
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speed was not considered to be a factor affecting performance. Students were later 
grouped into high, averaoe, and low ability levels based on final course grades 
calculated independently of the results of the experimental exam. Grade point 
average cut-off points were chosen to provide approximately equal numbers of 
students in each ability proup. The cut-off points were: high (3.6-4.0), average 
(3.1-3.5), and low (C-3.0). 

RESULTS 

Table 1 presents the item mean, test mean, and standard deviation for each 
test by ability group. A KR-20 reliability coefficient was computed for each 
test form for each ability aroup. To equate total number of items which could be given 
in the time used for a 5-option test, these reliabilities were then adjusted by the 
Spearman -Brown formula. This assumes that total testing time is proportional 
to the total number of alternatives, an assumption which is unlikely to be true 
for most item types but which is treated here as given. Adjusted and unadjusted 
reliability coefficients, number of items with non-zero variance, and sample 
sizes are presented in Table 2 for each ability level and for the combined sample. 

(Tables 1 and 2 here) 
Differences between reliability coefficients for groups and for test forms 
were tested with a statistic developed by Feldt (1969) and extended by Hakstian 
and Whalen (1976). The statistic (called "M") provides a test of the null 
hypothesis that reliability coefficients associated with k independent samples 
are equal and is based on the assumption that the scores on k parallel parts of 
a test conform to the assumptions of the two-factor random effects model of 
the analysis of variance: (1) a normally distributed population randomly sampled 
and (2) homoaeneity of variance for the k parts of the test. Simulation studies 
suggest the test to be robust and slightly conservative (Hakstian & Whalen, 1976). 
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Differences amonq reliabilities for the low ability group for the 3-, 4-, and 
5-choice tests were sianificant at a=.05 (M=10, df=2), but the trend of the 
reliabilities was clearly not the one hypothesized. In decreasing order of magnitude, 
the KR-20*s favored the 4-choice test, the 3-choice test, and the 5-choice test. 
For the high ability group differences anonn reliabilities were not significant; 
for the average ability group differences were significant at m=.1^ (M=5.94, df=2). 
Both of these last two results were contrary to hypotheses 2 and 3. The ability 
groups were combined and differences among reliabilities for the 3-, 4-, and 5-choice 
tests compared. These differenci were significant at 't=.05 (M=13.73, df=2). The 
optimal number of alternatives for all ability groups combined was four. 

Differences in reliabil .ties among ability levels v/ere also compared for each 
of the three tests. Differences were not significant (p>.05) for either the 3-, 
4-, or 5-choice test. 

DISCUSSION 

Results suggest that a relatively easy teacher-made test may not conform to 
the theoretically reasonable predictions regarding test reliability of examinees 
of varying ability levels. Failure to support Lord's (1977) and Weber's (1978) 
results may derive from various factors: item means on the French tests deviated 
from the statistically optimal difficulty level of p=.5 (overall item mean in this 
study was p=.78). The items were easier than those used in Lord's and in Weber's 
studies (median p = .5 and p " .65, respectively). The differences in item 
responses between ability groups ray have been lessened since the item set was 
relatively easy. 

Another difference between Lord's conditions and those in this study was the 
range of abilities available co categorize subjects as high, average, or low 
ability. Subjects in Lord's study had scaled scores on the 90'-item verbal section 
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of the SAT ranaing from 200 to 800. The ranqe of abilities in the French class, 
as determined by final orade, was quite narrow: 72% of the class received at least 
a 3.0 for a final grade. Instead of presenting a contrast of low versus high 
ability, it is likely that this study contrasted moderately high with slightly 
higher ability levels on an easy test. 

Another condition to consider is that tests were assigned randomly to quiz 
sections and not to individual students. Although quiz secti-Dns were not found to 
differ significantly in cumulative grade point averages, otner systematic 
differences may have existed between sections. 

In short, the conditions of this study differ from those idealistic conditions 
present in Lord's (1977) stu'y. However, it is suggested that the conditions of 
this study--a fairly narrow range of abilities and a test with fairly easy items-- 
are more representative of the typical classroom test than those in Lord's study 
which dealt with a more difficult test administered nationwide. It is interesting 
to note that the number of items with non-zero variance-- the number of items a 
reliability coefficient is based upon--tended to decrease from the low to high 
ability groups. This would suggest that for ea^y tests» the reliability for 
high ability groups may tend to be depressed . simply because of reduced variance 
amono iteir: responses in thr hiah ability nroups. 

If achievement tests are designed to be relatively easy for a college class 
(e.a., p?.7), it could be argued that items with fewer options would provide more 
efficient tests than items with more options. Ability range could probably be 
considered as homogeneous and nar»^ow, abilities relative to the tested range being 
high. This argument would be supported by those empirical studies finding 3-option 
tests preferable to 4- and 5-option tests (e.g., Coston, 1970, 1972; Straton & 
catts, 1980). 
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Table 1. 

I ten and last Means and Standard Deviations by Ability Group* 



Ability _ S^choice _ 4-choice _ 5-choice 
Group P.___X _S0 p X sp jD X SD 



Low-. " 


.69 


27 


.6 


4 


.0 


.69 


27 


.8 


6.5 


.70 


28.1 


3 


.5 


Averaqe 


.79 


31 


.5 


2 


8 


.81 


32 


.6 


4.3 


.75 


30.1 


3 


.6 


Hiph 


.85 


33 


.9 


3. 


0 


.87 


34 


.8 


3.1 


.85 


34.0 


3. 


,0 


Combined Sanple 


.77 


30. 


9 


4. 


2 


.78 


31 


4 


5.8 


.77 


30.9 


4. 


3 



*p was rounded to 2 diqits; X and SD were rounded to 1 digit. 
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Table 2. 

Adjusted and Unadjusted Kuder-Richardson 20 Reliability Coefficients 
by Ability Group and Number of Alternatives 



Ability 
Group 



3-choice 



4-choice 



5-ctioice 





Unadj. 
KR-20 


Adj. 
KR-20 


Sanple 
Size 


Options 
(s'^/0) 


Unadj. 
KR-20 


Adj. 
KR-20 


Sample 
Size 


Options 

{Hi}] 


Unadj. 
KR-20 


Adj. 
KR-20 


Sample 
Size 


Options 


Low 


.55 


.67 


22 


40 


.84 


.8/ 


25 


40 


.43 


.43 


18 


39 


^veraoe 


.20 


.30 


22 


38 


.74 


.78 


18 


34 


.59 


.55 


5 


30 


riiqh 


.53 


.65 


20 


35 




.61 


.66 


20 


33 


.52 


.52 


18 




''^"h^'ned 
Sapole 


.66 


.76 


64 


40 


.85 


.88 


63 


40 


.68 


.68 


45 


39 

■ i 
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